Forecasting plays a crucial role in airline management—Airlines project demand to plan the supply of services to respond to that demand. Forecasts enable airlines to make informed decisions about pricing, seat inventory control, and catering. Decisions like advertising and sales campaigns, aircraft scheduling, maintenance planning, and opening new sales offices depend upon predicted demand. Forecasts facilitate the creation of new routes and the acquisition of new aircraft.
However, accurate forecasting of bookings is a challenging task where seasonality plays an important role. Airlines see a surge in demand during the summer holidays or the festive season compared to the rest of the year. Competition, economic circumstances, and external happenings such as weather conditions, natural disasters, or political unrest also affect flight bookings. Airline companies rely on accurate forecasting to offer competitive prices to attract customers and gain a larger market share.
Commonly used techniques to generate a probable data set that relates back to the original data include: -
The objective of this project is to come up with the best forecasting model to predict air travel demand. The selected dataset is public data, which is available from data.world(Original Source: San Francisco Open Data). This dataset is a San Francisco International Airport Report on Monthly Passenger Traffic Statistics by Airline from 2005 to 2016. Airport data is seasonal. Hence, any comparative analyses would be performed on a period-over-period basis (such as March 2015 vs. March 2016) instead of period-to-period (such as. March 2015 vs. April 2015). Notably, fact and attribute field relationships are not always 1-to-1. For example, Passenger Counts belonging to Southwest Airlines will be seen in multiple attribute fields and are additive, allowing us to draw categorical Passenger Counts for the analyses.
The forecasting model developed by this project can be used to predict air travel demand. Airline companies need to make informed decisions about pricing, capacity, and other areas of their business to optimize their operations and maximize their profitability.
We will use the five steps mentioned below to find the best forecasting model to predict the demand for air travel.
from math import sin
from sklearn.preprocessing import PowerTransformer
%matplotlib inline
from collections import defaultdict
import warnings
import itertools
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy as stat
from scipy.stats import t
from scipy.stats import norm
from scipy.stats import chi2
import numpy.random as nr
from math import sqrt
import math
from pandas import Series
from statsmodels.tsa.stattools import adfuller
from matplotlib import pyplot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from matplotlib import pyplot as plt
import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
from pandas import read_csv
from pandas import DataFrame
from pandas import Grouper
from patsy import dmatrices
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.gofplots import qqplot
from pandas.plotting import autocorrelation_plot
from math import sqrt
from math import log
from math import exp
from scipy.stats import boxcox
from pandas import read_csv
from pandas import datetime
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from statsmodels.graphics.gofplots import qqplot
from pandas.plotting import autocorrelation_plot
import statsmodels.api as sm
from pandas import read_csv
from pandas import DataFrame
from scipy.stats import boxcox
from pandas import read_csv
from sklearn.metrics import mean_absolute_percentage_error
from statsmodels.tsa.arima_model import ARIMA
from scipy.stats import boxcox
warnings.filterwarnings("ignore")
import matplotlib.ticker as ticker
from IPython.display import display, HTML
import plotly.express as px
import plotly.graph_objects as go
display(HTML("<style>.container { width:100% !important; }</style>"))
air_travel = pd.read_csv('dataset//Air_Traffic_Passenger_Statistics.csv')
print('Number of Observations: ', len(air_travel))
display(air_travel.head())
Number of Observations: 15007
| Activity Period | Operating Airline | Operating Airline IATA Code | Published Airline | Published Airline IATA Code | GEO Summary | GEO Region | Activity Type Code | Price Category Code | Terminal | Boarding Area | Passenger Count | Adjusted Activity Type Code | Adjusted Passenger Count | Year | Month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200507 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Deplaned | Low Fare | Terminal 1 | B | 27271 | Deplaned | 27271 | 2005 | July |
| 1 | 200507 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Enplaned | Low Fare | Terminal 1 | B | 29131 | Enplaned | 29131 | 2005 | July |
| 2 | 200507 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Thru / Transit | Low Fare | Terminal 1 | B | 5415 | Thru / Transit * 2 | 10830 | 2005 | July |
| 3 | 200507 | Air Canada | AC | Air Canada | AC | International | Canada | Deplaned | Other | Terminal 1 | B | 35156 | Deplaned | 35156 | 2005 | July |
| 4 | 200507 | Air Canada | AC | Air Canada | AC | International | Canada | Enplaned | Other | Terminal 1 | B | 34090 | Enplaned | 34090 | 2005 | July |
air_travel.describe()
| Activity Period | Passenger Count | Adjusted Passenger Count | Year | |
|---|---|---|---|---|
| count | 15007.000000 | 15007.000000 | 15007.000000 | 15007.000000 |
| mean | 201045.073366 | 29240.521090 | 29331.917105 | 2010.385220 |
| std | 313.336196 | 58319.509284 | 58284.182219 | 3.137589 |
| min | 200507.000000 | 1.000000 | 1.000000 | 2005.000000 |
| 25% | 200803.000000 | 5373.500000 | 5495.500000 | 2008.000000 |
| 50% | 201011.000000 | 9210.000000 | 9354.000000 | 2010.000000 |
| 75% | 201308.000000 | 21158.500000 | 21182.000000 | 2013.000000 |
| max | 201603.000000 | 659837.000000 | 659837.000000 | 2016.000000 |
We observed that United Airlines, the main carrier at SFO, is listed as 'United Airlines' and 'United Airlines - Pre 07/01/2013'. There is no reference available that explains the second label. For analyzing data prior to July 2013, we relabeled those cases as 'United Airlines'.
air_travel.loc[air_travel["Operating Airline"].str.startswith('United Airlines'), "Operating Airline"] = 'United Airlines'
This dataset provides monthly passenger statistics that can be sliced under categories such as geographic regions, price, and activity type codes. We will format travel year and month to project the demand using various models.
month_num = {'January': '01', 'February': '02', 'March': '03', 'April': '04', 'May': '05', 'June': '06', 'July': '07',
'August': '08', 'September': '09', 'October': '10', 'November': '11', 'December': '12'}
air_travel['MM'] = air_travel['Month'].map(month_num)
air_travel['YY_MM']= (air_travel['Year'].astype(str) + "-" + air_travel['MM'].astype(str)).astype(str)
air_travel['YY_Month']= (air_travel['Year'].astype(str) + "-" + air_travel['Month']).astype(str)
air_travel['Activity_Period_Start_Date'] = pd.to_datetime(air_travel['Year'].astype(str) + air_travel['Month'], format='%Y%B')
air_travel.head()
| Activity Period | Operating Airline | Operating Airline IATA Code | Published Airline | Published Airline IATA Code | GEO Summary | GEO Region | Activity Type Code | Price Category Code | Terminal | Boarding Area | Passenger Count | Adjusted Activity Type Code | Adjusted Passenger Count | Year | Month | MM | YY_MM | YY_Month | Activity_Period_Start_Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200507 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Deplaned | Low Fare | Terminal 1 | B | 27271 | Deplaned | 27271 | 2005 | July | 07 | 2005-07 | 2005-July | 2005-07-01 |
| 1 | 200507 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Enplaned | Low Fare | Terminal 1 | B | 29131 | Enplaned | 29131 | 2005 | July | 07 | 2005-07 | 2005-July | 2005-07-01 |
| 2 | 200507 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Thru / Transit | Low Fare | Terminal 1 | B | 5415 | Thru / Transit * 2 | 10830 | 2005 | July | 07 | 2005-07 | 2005-July | 2005-07-01 |
| 3 | 200507 | Air Canada | AC | Air Canada | AC | International | Canada | Deplaned | Other | Terminal 1 | B | 35156 | Deplaned | 35156 | 2005 | July | 07 | 2005-07 | 2005-July | 2005-07-01 |
| 4 | 200507 | Air Canada | AC | Air Canada | AC | International | Canada | Enplaned | Other | Terminal 1 | B | 34090 | Enplaned | 34090 | 2005 | July | 07 | 2005-07 | 2005-July | 2005-07-01 |
Specifically, adjusted passenger numbers for the Activity Type Code 'Enplaned' are filtered for the time series analysis.
air_travel_original = air_travel
air_travel = air_travel.loc[air_travel['Activity Type Code'] == 'Enplaned']
We select the data for the activity type code 'Enplaned' and group by 'Year' to understand the data count per column and ensure that we have sufficient data points for each year.
air_travel.groupby('Year').count()
| Activity Period | Operating Airline | Operating Airline IATA Code | Published Airline | Published Airline IATA Code | GEO Summary | GEO Region | Activity Type Code | Price Category Code | Terminal | Boarding Area | Passenger Count | Adjusted Activity Type Code | Adjusted Passenger Count | Month | MM | YY_MM | YY_Month | Activity_Period_Start_Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | |||||||||||||||||||
| 2005 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 | 317 |
| 2006 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 | 624 |
| 2007 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 | 647 |
| 2008 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 | 653 |
| 2009 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 | 638 |
| 2010 | 636 | 636 | 633 | 636 | 633 | 636 | 636 | 636 | 636 | 636 | 636 | 636 | 636 | 636 | 636 | 636 | 636 | 636 | 636 |
| 2011 | 646 | 646 | 639 | 646 | 639 | 646 | 646 | 646 | 646 | 646 | 646 | 646 | 646 | 646 | 646 | 646 | 646 | 646 | 646 |
| 2012 | 650 | 650 | 642 | 650 | 642 | 650 | 650 | 650 | 650 | 650 | 650 | 650 | 650 | 650 | 650 | 650 | 650 | 650 | 650 |
| 2013 | 661 | 661 | 655 | 661 | 655 | 661 | 661 | 661 | 661 | 661 | 661 | 661 | 661 | 661 | 661 | 661 | 661 | 661 | 661 |
| 2014 | 659 | 659 | 658 | 659 | 658 | 659 | 659 | 659 | 659 | 659 | 659 | 659 | 659 | 659 | 659 | 659 | 659 | 659 | 659 |
| 2015 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 | 704 |
| 2016 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 | 181 |
time_bound_summary = air_travel[['Adjusted Passenger Count', 'Year','MM']].pivot_table(index=['MM'],columns=['Year'],aggfunc=sum, fill_value=0).T
time_bound_summary
| MM | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | |||||||||||||
| Adjusted Passenger Count | 2005 | 0 | 0 | 0 | 0 | 0 | 0 | 1569412 | 1553146 | 1363140 | 1384822 | 1304433 | 1350000 |
| 2006 | 1182313 | 1094470 | 1337770 | 1371576 | 1408991 | 1570272 | 1583055 | 1539211 | 1353586 | 1417003 | 1325354 | 1374406 | |
| 2007 | 1217679 | 1144391 | 1405982 | 1420480 | 1521884 | 1650947 | 1656412 | 1687024 | 1471094 | 1568276 | 1465087 | 1477301 | |
| 2008 | 1297099 | 1294648 | 1560380 | 1504074 | 1653760 | 1738334 | 1761579 | 1774735 | 1499313 | 1569545 | 1383133 | 1491674 | |
| 2009 | 1290035 | 1173406 | 1454796 | 1500465 | 1592457 | 1733926 | 1798641 | 1790589 | 1602188 | 1633418 | 1490890 | 1550460 | |
| 2010 | 1359375 | 1249672 | 1541956 | 1559704 | 1687160 | 1836349 | 1846490 | 1845624 | 1683566 | 1752673 | 1585534 | 1591589 | |
| 2011 | 1409136 | 1295211 | 1561541 | 1593407 | 1771329 | 1900610 | 1932621 | 1923748 | 1783953 | 1804254 | 1670612 | 1742295 | |
| 2012 | 1571845 | 1493150 | 1739289 | 1764188 | 1915521 | 2078066 | 2103777 | 2165149 | 1890812 | 1934113 | 1744729 | 1747550 | |
| 2013 | 1558685 | 1478579 | 1794640 | 1774595 | 1988752 | 2092041 | 2060059 | 2154833 | 1884343 | 1957611 | 1756833 | 1916814 | |
| 2014 | 1675812 | 1533190 | 1879974 | 1916848 | 2078697 | 2180660 | 2195683 | 2248755 | 1948420 | 2034949 | 1830334 | 1941106 | |
| 2015 | 1736575 | 1611621 | 2000314 | 1992518 | 2173731 | 2309593 | 2351298 | 2356914 | 2099362 | 2216408 | 2017137 | 2089548 | |
| 2016 | 1825846 | 1753761 | 2070896 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# plot overall dataset plot
dmd_series=air_travel[['YY_MM','Adjusted Passenger Count']].groupby('YY_MM').sum()
dmd_series.plot(figsize=(20, 5), title="Air Travel Demand Plot" , xlabel = "Month-Year", fontsize=12)
plt.show()
warnings.filterwarnings("ignore")
fig, ax = plt.subplots(figsize=(20, 5) )
train_pctg = 79
X = dmd_series
train_size = int(len(X) * train_pctg / 100)
train, test = X[0:train_size], X[train_size:len(X)]
print('Observations: %d' % (len(X)))
print('Training Observations: %d' % (len(train)))
print('Testing Observations: %d' % (len(test)))
plt.plot(train ,'o-', label='Training Data', linewidth=3)
plt.plot(test , 'o--', label='Testing Data', linewidth=2)
ax.set_title("Air Travel Demand Time Series", fontsize='15')
ax.set_xlabel("Demand Period - Year-Months" , fontsize=15)
ax.set_ylabel("Demand Quantity ")
# Rotating X-axis labels
ax.tick_params(axis='x', labelrotation = 50)
space = 6
ax.xaxis.set_major_locator(ticker.MultipleLocator(space))
plt.legend()
plt.show()
Observations: 129 Training Observations: 101 Testing Observations: 28
We excluded 'South America' region due to insufficient data points.
warnings.filterwarnings("ignore")
c = 0
region=air_travel[air_travel['GEO Region']!='South America']['GEO Region'].unique()
fig, ax = plt.subplots(nrows=len(region), ncols=1, figsize=(20, 40))
for s in region:
region_series = air_travel[(air_travel['GEO Region']==s) & (air_travel['Year']<2016) & (air_travel['GEO Region']!='South America')]
region_time_data=region_series[['YY_MM','Adjusted Passenger Count']].groupby('YY_MM').sum()
X = region_time_data
train_size = int(len(X) * train_pctg / 100)
train, test = X[0:train_size], X[train_size:len(X)]
ax[c].plot(train, label='Training Data', linewidth=2)
ax[c].plot(test , 'r--', label='Testing Data', linewidth=2)
# Rotating X-axis labels
ax[c].tick_params(axis='x', labelrotation = 50)
ax[c].tick_params(axis='x', labelrotation = 50)
ax[c].xaxis.set_major_locator(ticker.MultipleLocator(space))
ax[c].set_title("{} Region Demand Time Series".format(s), fontsize='12')
ax[c].set_xlabel("Demand Period - Year-Months" , fontsize=12)
ax[c].set_ylabel("Demand Quantity ")
c+=1
# using padding
fig.tight_layout(pad=5.0)
plt.legend()
plt.show()
df_airline = air_travel_original.groupby(["Operating Airline"]).sum().sort_values("Adjusted Passenger Count", ascending=False).head(20)
df_airline.reset_index(inplace=True)
plt.figure(figsize = (15,7))
plt.title(" Adjusted Passenger Count by Operating Airline", fontsize=18)
plt.bar(df_airline["Operating Airline"], df_airline["Adjusted Passenger Count"],color= 'silver', #'#227d3d',
edgecolor='black', linewidth = 1)
plt.xlabel("Operating Airline",fontsize=15)
plt.ylabel("Adjusted Passenger Count",fontsize=15)
plt.xticks(fontsize=12, rotation=90)
plt.yticks(fontsize=12)
for k,v in df_airline["Adjusted Passenger Count"].items():
if v>300000:
plt.text(k,v-120000, str(v), fontsize=12,rotation=90,color='k', horizontalalignment='center');
else:
plt.text(k,v+15000, str(v), fontsize=12,rotation=90,color='k', horizontalalignment='center');
We use different categorical variables to create a summary plot with sankey. We will use the sankey_ly function, a plotly wrapper that creates the data transformation and plots with plotly. We will rank the top 20 airlines during 2015 and plot the passengers distribution by airline, travel type (domestic or international), travel geo region, activity type (enplaned, deplaned and transit) and fare type (low or other):
yr_end = 2015
top_n = 20
air_travel_f =air_travel_original.loc[(air_travel_original['Year'] == yr_end)]
air_travel_g = air_travel_f.groupby(['Operating Airline']).sum('Adjusted Passenger Count').reset_index()
top_n_vals= air_travel_g.nlargest(top_n, 'Adjusted Passenger Count')
air_travel_f = air_travel_f[air_travel_f['Operating Airline'].isin(top_n_vals['Operating Airline'])]
color_link = ['#000000', '#FFFF00', '#1CE6FF', '#FF34FF', '#FF4A46',
'#008941', '#006FA6', '#A30059','#FFDBE5', '#7A4900',
'#0000A6', '#63FFAC', '#B79762', '#004D43', '#8FB0FF',
'#997D87', '#5A0007', '#809693', '#FEFFE6', '#1B4400',
'#4FC601', '#3B5DFF', '#4A3B53', '#FF2F80', '#61615A',
'#BA0900', '#6B7900', '#00C2A0', '#FFAA92', '#FF90C9',
'#B903AA', '#D16100', '#DDEFFF', '#000035', '#7B4F4B',
'#A1C299', '#300018', '#0AA6D8', '#013349', '#00846F',
'#372101', '#FFB500', '#C2FFED', '#A079BF', '#CC0744',
'#C0B9B2', '#C2FF99', '#001E09', '#00489C', '#6F0062',
'#0CBD66', '#EEC3FF', '#456D75', '#B77B68', '#7A87A1',
'#788D66', '#885578', '#FAD09F', '#FF8A9A', '#D157A0',
'#BEC459', '#456648', '#0086ED', '#886F4C', '#34362D',
'#B4A8BD', '#00A6AA', '#452C2C', '#636375', '#A3C8C9',
'#FF913F', '#938A81', '#575329', '#00FECF', '#B05B6F',
'#8CD0FF', '#3B9700', '#04F757', '#C8A1A1', '#1E6E00',
'#7900D7', '#A77500', '#6367A9', '#A05837', '#6B002C',
'#772600', '#D790FF', '#9B9700', '#549E79', '#FFF69F',
'#201625', '#72418F', '#BC23FF', '#99ADC0', '#3A2465',
'#922329', '#5B4534', '#FDE8DC', '#404E55', '#0089A3',
'#CB7E98', '#A4E804', '#324E72', '#6A3A4C']
air_travel_g =air_travel_f.groupby(['Operating Airline', 'GEO Summary']).sum('Adjusted Passenger Count').reset_index()
src = air_travel_g['Operating Airline']
tgt = air_travel_g['GEO Summary']
vals = air_travel_g['Adjusted Passenger Count']
air_travel_g =air_travel_f.groupby(['Operating Airline', 'GEO Summary', 'GEO Region']).sum('Adjusted Passenger Count').reset_index()
src = pd.concat([src, air_travel_g['GEO Summary']])
tgt = pd.concat([tgt, air_travel_g['GEO Region']])
vals = pd.concat([vals, air_travel_g['Adjusted Passenger Count']])
air_travel_g =air_travel_f.groupby(['Operating Airline', 'GEO Summary', 'GEO Region', 'Adjusted Activity Type Code']).sum('Adjusted Passenger Count').reset_index()
src = pd.concat([src, air_travel_g['GEO Region']])
tgt = pd.concat([tgt, air_travel_g['Adjusted Activity Type Code']])
vals = pd.concat([vals, air_travel_g['Adjusted Passenger Count']])
air_travel_g =air_travel_f.groupby(['Operating Airline', 'GEO Summary', 'GEO Region', 'Adjusted Activity Type Code', 'Price Category Code']).sum('Adjusted Passenger Count').reset_index()
src = pd.concat([src, air_travel_g['Adjusted Activity Type Code']])
tgt = pd.concat([tgt, air_travel_g['Price Category Code']])
vals = pd.concat([vals, air_travel_g['Adjusted Passenger Count']])
node_label = pd.concat([src, tgt])
unq = set(node_label)
node_label = np.array(list(unq))
node_dict = {y:x for x, y in enumerate(node_label)}
source_node = [node_dict[x] for x in src]
target_node = [node_dict[x] for x in tgt]
link3 = dict(source=source_node, target=target_node, value=vals, color=color_link)
node3 = dict(label = node_label, pad=35, thickness=20)
data3 = go.Sankey(link=link3)
fig3 = ig = go.Figure(
data=[go.Sankey( # The plot we are interest
# This part is for the node information
node = dict(
label = node_label
),
# This part is for the link information
link = dict(
source = source_node,
target = target_node,
value = vals
))])
fig3.update_layout(
hovermode='x',
title='Passengers Distribution During 2015',
font=dict(size=10, color='white'),
paper_bgcolor='#51504f'
)
fig3.show()
Let’s see the distribution of passengers by geo type (domestic vs. international):
air_travel_sum =air_travel_original.loc[(air_travel_original['Year'] == yr_end)].groupby(['GEO Summary']).sum('Adjusted Passenger Count').reset_index()
fig = px.pie(air_travel_original, values='Adjusted Passenger Count', names='GEO Summary')
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()
About 77% of the air passengers traffic during 2015 were domestic.
Let’s use tree map plot to see the distribution of passengers by airline and geo region:
fig = px.treemap(air_travel_original.loc[(air_travel_original['Year'] == yr_end) & (air_travel_original['GEO Summary'] == 'Domestic')], path=[px.Constant("Domestic"), 'Operating Airline'],
values='Adjusted Passenger Count', color='Operating Airline', color_discrete_map={'(?)':'lightgrey'})
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.data[0].textinfo = 'label+text+value+percent parent'
fig.show()
fig = px.treemap(air_travel_original.loc[(air_travel['Year'] == yr_end) & (air_travel_original['GEO Summary'] == 'International')], path=[px.Constant("International"), 'Operating Airline'],
values='Adjusted Passenger Count', color='Operating Airline', color_discrete_map={'(?)':'lightgrey'})
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.data[0].textinfo = 'label+text+value+percent parent'
fig.show()
We can see that United Airlines is the major carrier in both domestic and international flights.
import plotly.graph_objects as go
yr_str = 2005
yr_end = 2016
air_travel_sum =air_travel_original.loc[((air_travel_original['Year'] > yr_str)
& (air_travel_original['Year'] < yr_end)
& (air_travel_original['GEO Summary'] == 'Domestic'))]
air_travel_sum = air_travel_sum.groupby(['Year']).sum('Adjusted Passenger Count').reset_index()
fig = go.Figure()
fig.add_trace(go.Scatter(
x=air_travel_sum['Year'], y=air_travel_sum['Adjusted Passenger Count'],
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color='rgb(184, 247, 212)'),
stackgroup='one',
name = 'Domestic'
))
air_travel_sum =air_travel_original.loc[((air_travel_original['Year'] > yr_str)
& (air_travel_original['Year'] < yr_end)
& (air_travel_original['GEO Summary'] == 'International'))]
air_travel_sum = air_travel_sum.groupby(['Year']).sum('Adjusted Passenger Count').reset_index()
fig.add_trace(go.Scatter(
x=air_travel_sum['Year'], y=air_travel_sum['Adjusted Passenger Count'],
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color='rgb(131, 90, 241)'),
name = 'International',
stackgroup='one' # define stack group
))
fig.update_layout()
fig.show()
We observe that there is a significant growth in the trend from 2007 up until 2012, with passenger numbers levelling off after that. This is applicable to both domestic and international travel. However, there is a larger demand for domestic air travel.
air_travel_sum =air_travel_original.loc[((air_travel_original['Operating Airline'] == 'KLM Royal Dutch Airlines')
| (air_travel_original['Operating Airline'] == 'American Airlines')
| (air_travel_original['Operating Airline'] == 'Delta Air Lines')
| (air_travel_original['Operating Airline'] == 'Southwest Airlines')
| (air_travel_original['Operating Airline'] == 'United Airlines'))]
air_travel_sum = air_travel_sum.groupby(['Operating Airline', 'Year', 'Month']).sum('Adjusted Passenger Count').reset_index()
month_names = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
fig = px.scatter(air_travel_sum, y="Year", x="Adjusted Passenger Count", color="Operating Airline", facet_col="Month",
labels={"Adjusted Passenger Count": "Passengers"},
category_orders={"Month": month_names})
fig.show()
Here is the monthly distribution of pessengers for selected airlines during 2015:
air_travel_sum =air_travel_original.loc[((air_travel_original['Operating Airline'] == 'Southwest Airlines')
| (air_travel_original['Operating Airline'] == 'Delta Air Lines')
| (air_travel_original['Operating Airline'] == 'United Airlines'))]
air_travel_sum = air_travel_sum.groupby(['Operating Airline', 'Year', 'MM']).sum('Adjusted Passenger Count').reset_index()
import plotly.graph_objects as go
fig = go.Figure()
yr_end = 2015
fig.add_trace(go.Scatter(
x=month_names, y=air_travel_sum.loc[(air_travel_sum['Year'] == yr_end)
& (air_travel_sum['Operating Airline'] == 'Southwest Airlines')]['Adjusted Passenger Count'],
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color='rgb(184, 247, 212)'),
stackgroup='one',
name = 'Southwest Airlines'
))
fig.add_trace(go.Scatter(
x=month_names, y=air_travel_sum.loc[(air_travel_sum['Year'] == yr_end)
& (air_travel_sum['Operating Airline'] == 'Delta Air Lines')]['Adjusted Passenger Count'],
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color='rgb(111, 231, 219)'),
stackgroup='one',
name = 'Delta Air Lines'
))
fig.add_trace(go.Scatter(
x=month_names, y=air_travel_sum.loc[(air_travel_sum['Year'] == yr_end)
& (air_travel_sum['Operating Airline'] == 'United Airlines')]['Adjusted Passenger Count'],
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color='rgb(131, 90, 241)'),
name = 'United Airlines',
stackgroup='one' # define stack group
))
fig.update_layout()
fig.show()
We see that passenger numbers appear to be highest from approximately May — September, after which we see a dip in numbers for the rest of the year.
df_piv = air_travel_original[["Price Category Code","GEO Summary", "Adjusted Passenger Count"]]
pd.pivot_table(df_piv , values = 'Adjusted Passenger Count' , index = 'GEO Summary' , columns = 'Price Category Code' , aggfunc = 'sum')
| Price Category Code | Low Fare | Other |
|---|---|---|
| GEO Summary | ||
| Domestic | 74527292 | 264515345 |
| International | 680815 | 100460628 |
sns.countplot(data = air_travel_original, x = 'Price Category Code' , hue = 'GEO Summary' , palette = 'viridis')
<Axes: xlabel='Price Category Code', ylabel='count'>
The above graph shows that the 'Low fare' price code category is popular in Domestic flights. The 'Other' price category code has the most passengers on international flights.
sns.countplot(data = air_travel_original , x = 'Price Category Code' , hue = 'GEO Region' , palette = 'viridis')
<Axes: xlabel='Price Category Code', ylabel='count'>
We see the highest air travel demand in the US region under the 'Other' price category code. To evaluate the forecasting models, we will select the 'Other' price code category and US region as sample data.
Here are the main attributes that can be used to slice the data: Price Category Code, GEO Region, GEO Summary, and Operating Airline. We will filter and create separate datasets for each attribute to use in the forecasting model. In our visualizations above, we observed the highest records for some combined attributes, so we will also create some combined datasets to test our model on such high-demand attributes. Sometimes, forecasting on a focused dataset helps businesses improve product projection accuracy across regions, increasing overall sales and profitability.
print(price_codes)
print(region)
print(geo_summary)
['Low Fare' 'Other'] ['US' 'Canada' 'Asia' 'Europe' 'Australia / Oceania' 'Mexico' 'Central America' 'Middle East'] ['Domestic' 'International']
# by Price Category Code
low_fare_price_ds= air_travel[air_travel['Price Category Code']=='Low Fare']
other_price_ds= air_travel[air_travel['Price Category Code']=='Other']
# by Price GEO Summary Code
domestic_ds= air_travel[air_travel['GEO Summary']=='Domestic']
international_ds= air_travel[air_travel['GEO Summary']=='International']
# by GEO Region
us_ds= air_travel[air_travel['GEO Region']=='US']
canada_ds= air_travel[air_travel['GEO Region']=='Canada']
asia_ds= air_travel[air_travel['GEO Region']=='Asia']
europe_ds= air_travel[air_travel['GEO Region']=='Europe']
australia_ds= air_travel[air_travel['GEO Region']=='Australia / Oceania']
mexico_ds= air_travel[air_travel['GEO Region']=='Mexico']
central_america_ds= air_travel[air_travel['GEO Region']=='Central America']
middle_east_ds= air_travel[air_travel['GEO Region']=='Middle East']
# Create Combined datasets
us_otherPriceCode_ds= air_travel[ (air_travel['Price Category Code']=='Other') & (air_travel['GEO Region']=='US') ]
domestic_lowFarePriceCode_ds= air_travel[ (air_travel['Price Category Code']=='Low Fare') & (air_travel['GEO Summary']=='Domestic') ]
us_unitedAirline_ds= air_travel[ (air_travel['Operating Airline']=='United Airlines') & (air_travel['GEO Region']=='US') ]
# Create airline datasets
unitedAirline_ds= air_travel[(air_travel['Operating Airline'].str.startswith('United Airlines'))]
display(unitedAirline_ds.head())
americanAirline_ds= air_travel[air_travel['Operating Airline']=='American Airlines']
klm_ds= air_travel[air_travel['Operating Airline']=='KLM Royal Dutch Airlines']
| Activity Period | Operating Airline | Operating Airline IATA Code | Published Airline | Published Airline IATA Code | GEO Summary | GEO Region | Activity Type Code | Price Category Code | Terminal | Boarding Area | Passenger Count | Adjusted Activity Type Code | Adjusted Passenger Count | Year | Month | MM | YY_MM | YY_Month | Activity_Period_Start_Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 90 | 200507 | United Airlines | UA | United Airlines | UA | Domestic | US | Enplaned | Other | Terminal 1 | B | 60939 | Enplaned | 60939 | 2005 | July | 07 | 2005-07 | 2005-July | 2005-07-01 |
| 92 | 200507 | United Airlines | UA | United Airlines - Pre 07/01/2013 | UA | Domestic | US | Enplaned | Low Fare | Terminal 3 | E | 58277 | Enplaned | 58277 | 2005 | July | 07 | 2005-07 | 2005-July | 2005-07-01 |
| 94 | 200507 | United Airlines | UA | United Airlines - Pre 07/01/2013 | UA | Domestic | US | Enplaned | Other | Terminal 3 | F | 421802 | Enplaned | 421802 | 2005 | July | 07 | 2005-07 | 2005-July | 2005-07-01 |
| 97 | 200507 | United Airlines | UA | United Airlines - Pre 07/01/2013 | UA | International | Asia | Enplaned | Other | International | G | 68278 | Enplaned | 68278 | 2005 | July | 07 | 2005-07 | 2005-July | 2005-07-01 |
| 100 | 200507 | United Airlines | UA | United Airlines - Pre 07/01/2013 | UA | International | Australia / Oceania | Enplaned | Other | International | G | 8234 | Enplaned | 8234 | 2005 | July | 07 | 2005-07 | 2005-July | 2005-07-01 |
We will create function to create 3 formmated dataset - Full_Dataset, and Train_Dataset, Test_Dataset which we will be used in various function ahead to plot and calculate Foreasting model calculation and plot forecasting output.
######## Function to Create Traing and Testing Dataset from Dataframe
def create_dataset(df, train_size_percentage):
full_dataset=df[['YY_MM','Adjusted Passenger Count']].groupby('YY_MM').sum()
train_size = round((train_size_percentage / 100) * (len(full_dataset)))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = full_dataset[0:train_size], full_dataset[train_size:len(full_dataset)]
return full_dataset, train_dataset, test_dataset
# Create Formatted Dataset for Price Category Code
low_fare_full_df, low_fare_train_df , low_fare_test_df = create_dataset(low_fare_price_ds, train_pctg)
other_full_df, other_train_df , other_test_df = create_dataset(low_fare_price_ds, train_pctg)
# Create Formatted Dataset for GEO Region
us_full_df, us_train_df , us_test_df = create_dataset(us_ds, train_pctg)
canada_full_df, canada_train_df , canada_test_df = create_dataset(canada_ds, train_pctg)
asia_full_df, asia_train_df , asia_test_df = create_dataset(asia_ds, train_pctg)
europe_full_df, europe_train_df , europe_test_df = create_dataset(europe_ds, train_pctg)
australia_full_df, australia_train_df , australia_test_df = create_dataset(australia_ds, train_pctg)
mexico_full_df, mexico_train_df , mexico_test_df = create_dataset(mexico_ds, train_pctg)
central_america_df, central_america_train_df , central_america_test_df = create_dataset(central_america_ds, train_pctg)
middle_east_df, middle_east_train_df , middle_east_test_df = create_dataset(middle_east_ds, train_pctg)
# Create Formatted Dataset for GEO Summary
domestic_full_df, domestic_train_df , domestic_test_df = create_dataset(domestic_ds, train_pctg)
international_full_df, international_train_df , international_test_df = create_dataset(international_ds, train_pctg)
# Create Formatted Dataset for Combined datasets
us_otherPriceCode_full_df, us_otherPriceCode_train_df , us_otherPriceCode_test_df = create_dataset(us_otherPriceCode_ds, train_pctg)
domestic_lowFarePriceCode_full_df, domestic_lowFarePriceCode_train_df , domestic_lowFarePriceCode_test_df = create_dataset(domestic_lowFarePriceCode_ds, 66)
us_unitedAirline_full_df, us_unitedAirline_train_df , us_unitedAirline_test_df = create_dataset(us_unitedAirline_ds, train_pctg)
# Create Formatted Dataset for Operating Airline
unitedAirline_full_df, unitedAirline_train_df , unitedAirline_test_df = create_dataset(unitedAirline_ds, train_pctg)
americanAirline_full_df, americanAirline_train_df , americanAirline_test_df = create_dataset(americanAirline_ds, train_pctg)
klm_full_df, klm_train_df , klm_test_df = create_dataset(klm_ds, train_pctg)
####### Function to Create Yearly Traing and Testing Dataset from Dataframe
def create_yearly_dataset(df, train_size_percentage):
full_dataset=df[['Year','Adjusted Passenger Count']].groupby('Year').sum()
train_size = round((train_size_percentage / 100) * (len(full_dataset)))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = full_dataset[0:train_size], full_dataset[train_size:len(full_dataset)]
return full_dataset, train_dataset, test_dataset
low_fare_yearly_full_df, low_fare_yearly_train_df , low_fare_yearly_test_df = create_yearly_dataset(low_fare_price_ds, train_pctg)
other_yearly_full_df, other_yearly_train_df , other_yearly_test_df = create_yearly_dataset(other_price_ds, train_pctg)
From Price Category Code: Other price cateogory code
From GEO Region: US
From GEO Summary: International
From Combined attributes:
Domestic - Low fare price category code
From Operating Airline:
KLM
United Airline
American Airline
Summary Statistics summarizes and provides the gist of information about the sample data. It can include mean, median, mode, minimum value, maximum value, range, standard deviation, etc. Below are the summary statistics of the United Airlines and US region datasets.
klm_full_df.describe()
| Adjusted Passenger Count | |
|---|---|
| count | 129.000000 |
| mean | 9062.155039 |
| std | 2584.274473 |
| min | 4421.000000 |
| 25% | 6542.000000 |
| 50% | 9513.000000 |
| 75% | 11633.000000 |
| max | 12888.000000 |
unitedAirline_full_df.describe()
| Adjusted Passenger Count | |
|---|---|
| count | 129.000000 |
| mean | 654412.736434 |
| std | 92823.320734 |
| min | 453529.000000 |
| 25% | 591246.000000 |
| 50% | 649946.000000 |
| 75% | 715628.000000 |
| max | 874616.000000 |
Graphical visualizations help us understand the trends, seasonality, and residual component in the dataset. This information is required to select a suitable Forecasting Model. We will create a function to plot the following graphs:
def plot_dataseries(df, name):
from matplotlib import pyplot
fig, ax = plt.subplots(1, 4 ,figsize=(25, 4) )
df = df.reset_index()
df_box= df
df_box['year'] = df_box['YY_MM'].astype(str).str[:4]
df_box = df[['year','Adjusted Passenger Count']]
series= df['Adjusted Passenger Count']
plt.subplot(1, 4, 1)
series.plot(title="Series Plot for "+name)
plt.subplot(1, 4, 2)
series.hist()
plt.title("Histogram for "+name)
plt.subplot(1, 4, 3)
series.plot(kind='kde',title="Density Plot for "+name)
plt.subplot(1, 4, 4 )
sns.boxplot(x=df_box['year'], y=df_box['Adjusted Passenger Count'] , palette="Blues" , width=0.3)
# Rotating X-axis labels
plt.tick_params(axis='x', labelrotation = 50)
plt.title("Boxplot for "+name)
plt.show()
def plot_detrend_series(df , dataset_name ):
fig, axs = pyplot.subplots(2, 3 ,figsize=(25, 6) )
series = df
X= [i for i in range(0, len(df))]
X = np.reshape(X, (len(X), 1))
y = boxcox(df['Adjusted Passenger Count'],0.0) ## Used BoxCoX Transformation to do log transformation with lambda = 0.0
model = LinearRegression()
model.fit(X, y)
trend = model.predict(X)
# plot trend
pyplot.subplot(2,3,1)
pyplot.plot(y)
pyplot.plot(trend)
pyplot.title("Log Transform Trend Plot for "+dataset_name)
trend_df= DataFrame(trend)
# detrend
detrended = [y[i]-trend[i] for i in range(0, len(series))]
#####################################################################
# plot detrended
pyplot.subplot(2, 3, 2)
pyplot.plot(detrended)
pyplot.title("De-Trend Plot for "+dataset_name )
residuals = DataFrame(detrended)
pyplot.subplot(2, 3, 3)
pyplot.hist(residuals)
pyplot.title("Histogram of DeTrend Series for "+dataset_name )
# density plot
pyplot.subplot(2, 3, 4)
sns.kdeplot(detrended )
pyplot.title("Density plot for DeTrend Series for "+dataset_name )
pyplot.subplot(2, 3, 5)
qqplot(residuals, line='r', ax=pyplot.gca())
pyplot.title("QQ plot DeTrend Series for "+dataset_name )
pyplot.subplot(2, 3, 6)
autocorrelation_plot(residuals, ax=pyplot.gca())
pyplot.title("Autocoraltion DeTrend Series for "+dataset_name )
pyplot.show()
return residuals
plot_dataseries(klm_full_df,'KLM')
print("-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------")
series_detrend=plot_detrend_series( klm_full_df, "KLM")
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
plot_dataseries(unitedAirline_full_df,'United Airline')
print("-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------")
series_detrend=plot_detrend_series( unitedAirline_full_df, "United Airline")
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
We will evaluate below Forecasting Models:
def print_plot_autocorelation(df,months_in_year, dataset_name):
from pandas import Series
#from statsmodels.tsa.stattools import adfuller
def difference(dataset, interval=1):
diff = list()
for i in range(interval, len(dataset)):
value = dataset[i] - dataset[i - interval]
diff.append(value)
return Series(diff)
X = pd.Series( df['Adjusted Passenger Count']).values #series.values
#X= series.values
#print(X)
X = X.astype('float')
# difference data
months_in_year = 12
stationary = difference(X, months_in_year)
stationary.index = df.index[months_in_year:]
# check if stationary
result = adfuller(stationary)
print("ADF Statistic for :" + dataset_name)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
## Plot Stationary
pyplot.figure(figsize=(20,4))
pyplot.subplot(131)
stationary.plot(title = "Stationary Series Plot for "+ dataset_name)
# Rotating X-axis labels
plt.tick_params(axis='x', labelrotation = 50)
pyplot.subplot(132)
plot_acf(stationary, lags=11, ax=pyplot.gca(), title = "Autocorelation plot for :" + dataset_name)
pyplot.subplot(133)
plot_pacf(stationary, lags=11, ax=pyplot.gca(), title = "Partial Autocorelation plot for :" + dataset_name)
pyplot.show()
print_plot_autocorelation(klm_train_df,12, "KLM")
ADF Statistic for :KLM ADF Statistic: -2.525378 p-value: 0.109421 Critical Values: 1%: -3.518 5%: -2.900 10%: -2.587
print_plot_autocorelation(unitedAirline_train_df,12, "United Airlines")
ADF Statistic for :United Airlines ADF Statistic: -2.260012 p-value: 0.185189 Critical Values: 1%: -3.506 5%: -2.895 10%: -2.584
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_squared_error
warnings.filterwarnings("ignore")
import matplotlib.ticker as ticker
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
def boxcox_inverse(value, lam):
if lam == 0:
return exp(value)
return exp(log(lam * value + 1) / lam)
def calculate_boxcox_linreg(df, dataset_name, train_percent):
series = df
X = DataFrame(series.values)
X.columns = ['Adjusted Passenger Count']
X['Adjusted Passenger Count'], lam = boxcox(X['Adjusted Passenger Count'])
print('BoxCox lambda: %f' %lam)
pyplot.figure(figsize=(20,3) )
# line plot
pyplot.subplot(131)
pyplot.plot(X['Adjusted Passenger Count'])
pyplot.title("BoxCox Series Plot :"+ dataset_name)
# histogram
pyplot.subplot(132)
pyplot.hist(X['Adjusted Passenger Count'])
pyplot.title("BoxCox Series Historgram Plot :"+dataset_name)
#pyplot.show()
pyplot.subplot(133)
qqplot(X['Adjusted Passenger Count'], line='r', ax=pyplot.gca())
pyplot.title("BoxCox Series QQ Plot :"+dataset_name)
pyplot.show()
###################################################################
X = pd.Series( df['Adjusted Passenger Count']).values #series.values
X = X.astype('float32')
train_size = int(len(X) * train_percent / 100)
train, test = X[0:train_size], X[train_size:]
train_x = np.arange(len(train))
history = [x for x in train]
predictions = list()
for i in range(len(test)):
transformed ,lam = boxcox(history)
if lam < -5:
transformed, lam = history, 1
fit = np.polyfit(np.arange(len(transformed)), transformed, deg=2)
fit_function = np.poly1d(fit)
prediction = fit_function(len(train) + i)
yhat = boxcox_inverse(prediction, lam)
predictions.append(yhat)
# observation
obs = test[i]
#history.append(obs)
print('>Predicted=%.3f, Expected=%.3f' % (yhat, obs))
# report performance
rmse = sqrt(mean_squared_error(test, predictions))
print( "For "+dataset_name + " Dataset : " +'Test RMSE: %.3f' % rmse)
mae = mean_absolute_error(test, predictions)
print( "For "+dataset_name + " Dataset : " +'Test MAE: %.3f' % mae)
mape = mean_absolute_percentage_error(test, predictions)
print( "For "+dataset_name + " Dataset : " +'Test MAPE: %.3f' % mape)
train_linreg=list()
for i in range(len(train)):
linreg= fit_function(i)
train_l =boxcox_inverse(linreg, lam)
train_linreg.append(train_l)
pyplot.figure(figsize=(12,5), dpi=100)
print('BoxCox Lambda of Transformation used: %.3f' % lam)
pyplot.plot(train, label='Training Data', linewidth=2)
pyplot.plot( train_linreg , 'r', label='Train_LinReg', linewidth=2)
pyplot.plot([None for i in train] + [x for x in test] , 'b', label='Testing Data', linewidth=2)
pyplot.plot([None for i in train] + [x for x in predictions], 'r--', label='Forecast Data', linewidth=2)
pyplot.title("Linear Regression with BoxCox Transform \n Observed Vs Predicted Plot for " + dataset_name )
pyplot.legend(loc='upper left', fontsize=8)
pyplot.show()
model_summary.loc[('Linear Regression(BoxCox)', 'RMSE'),
dataset_name], model_summary.loc[('Linear Regression(BoxCox)', 'MAE'),
dataset_name], model_summary.loc[('Linear Regression(BoxCox)', 'MAPE'),
dataset_name]= rmse, mae, mape
calculate_boxcox_linreg(klm_full_df, "KLM", train_pctg)
BoxCox lambda: 1.163519
>Predicted=9050.380, Expected=6256.000 >Predicted=9026.256, Expected=5244.000 >Predicted=9001.523, Expected=4695.000 >Predicted=8976.180, Expected=6797.000 >Predicted=8950.226, Expected=10771.000 >Predicted=8923.659, Expected=12097.000 >Predicted=8896.477, Expected=11966.000 >Predicted=8868.680, Expected=11023.000 >Predicted=8840.265, Expected=12130.000 >Predicted=8811.232, Expected=11774.000 >Predicted=8781.577, Expected=10433.000 >Predicted=8751.300, Expected=5822.000 >Predicted=8720.399, Expected=6447.000 >Predicted=8688.872, Expected=6222.000 >Predicted=8656.718, Expected=5012.000 >Predicted=8623.933, Expected=6327.000 >Predicted=8590.517, Expected=10831.000 >Predicted=8556.467, Expected=11745.000 >Predicted=8521.781, Expected=11633.000 >Predicted=8486.458, Expected=10562.000 >Predicted=8450.494, Expected=11510.000 >Predicted=8413.888, Expected=11669.000 >Predicted=8376.637, Expected=10221.000 >Predicted=8338.739, Expected=7366.000 >Predicted=8300.191, Expected=7321.000 >Predicted=8260.992, Expected=5930.000 >Predicted=8221.137, Expected=5338.000 >Predicted=8180.626, Expected=7726.000 For KLM Dataset : Test RMSE: 2685.737 For KLM Dataset : Test MAE: 2542.532 For KLM Dataset : Test MAPE: 0.330 BoxCox Lambda of Transformation used: 1.261
calculate_boxcox_linreg(unitedAirline_full_df, "United Airline", train_pctg)
BoxCox lambda: 0.144278
>Predicted=753981.198, Expected=727154.000 >Predicted=760119.839, Expected=644051.000 >Predicted=766393.839, Expected=591906.000 >Predicted=772804.963, Expected=718590.000 >Predicted=779355.015, Expected=747788.000 >Predicted=786045.834, Expected=804017.000 >Predicted=792879.296, Expected=839154.000 >Predicted=799857.315, Expected=839442.000 >Predicted=806981.844, Expected=860524.000 >Predicted=814254.874, Expected=744977.000 >Predicted=821678.433, Expected=775863.000 >Predicted=829254.590, Expected=683178.000 >Predicted=836985.453, Expected=725061.000 >Predicted=844873.171, Expected=636301.000 >Predicted=852919.933, Expected=591246.000 >Predicted=861127.968, Expected=731963.000 >Predicted=869499.546, Expected=719732.000 >Predicted=878036.982, Expected=801291.000 >Predicted=886742.629, Expected=848604.000 >Predicted=895618.887, Expected=874616.000 >Predicted=904668.195, Expected=865614.000 >Predicted=913893.038, Expected=771810.000 >Predicted=923295.945, Expected=795921.000 >Predicted=932879.490, Expected=711754.000 >Predicted=942646.290, Expected=738399.000 >Predicted=952599.010, Expected=633387.000 >Predicted=962740.360, Expected=615685.000 >Predicted=973073.097, Expected=752729.000 For United Airline Dataset : Test RMSE: 152895.434 For United Airline Dataset : Test MAE: 122828.409 For United Airline Dataset : Test MAPE: 0.178 BoxCox Lambda of Transformation used: 0.334
Define a function to perform a search of ARIMA hyperparameters. Search p, d, and q combinations (skipping those that fail to converge). Return the combination that results in the best performance for the test dataset. Use grid search to explore all combinations in a set of integer values.
import warnings
from math import sqrt
from pandas import read_csv
from pandas import datetime
import statsmodels.tsa.arima_process as arima
from statsmodels.tsa.arima.model import ARIMA, ARIMAResults
from sklearn.metrics import mean_squared_error
warnings.filterwarnings("ignore")
# evaluate an ARIMA model for a given order (p,d,q)
def evaluate_arima_model(X, arima_order, train_percent):
# prepare training dataset
train_size = int(len(X) * train_percent / 100)
train, test = X[0:train_size], X[train_size:]
history = [x for x in train]
#print(history)
# make forecast prediction value
predictions = list()
for t in range(len(test)):
model = ARIMA(history, order=arima_order)
model_fit = model.fit()
yhat = model_fit.forecast()[0]
predictions.append(yhat)
history.append(test[t])
# calculate out of forecast error
rmse = sqrt(mean_squared_error(test, predictions))
return rmse
# evaluate combinations of p, d and q values for an ARIMA model
def evaluate_models(dataset, p_values, d_values, q_values, dataset_name, train_percent):
dataset = dataset.astype('float32')
best_score, best_cfg = float("inf"), None
for p in p_values:
for d in d_values:
for q in q_values:
order = (p,d,q)
try:
rmse = evaluate_arima_model(dataset, order, train_percent)
if rmse < best_score:
best_score, best_cfg = rmse, order
#print('ARIMA%s RMSE=%.3f' % (order,rmse))
except:
continue
print( "For "+ dataset_name + ' Best ARIMA%s RMSE=%.3f' % (best_cfg, best_score))
return best_score, best_cfg
p_values = [0, 1, 2, 4]
d_values = range(0, 3)
q_values = range(0, 3)
warnings.filterwarnings("ignore")
klm_rmse, klm_arima_order = evaluate_models(klm_full_df.values, p_values, d_values, q_values,"KLM", train_pctg)
For KLM Best ARIMA(4, 0, 2) RMSE=1089.247
Printing of Intermidiate result for ARIMA Parmater was commented and only Best ARIMA parameter is printed to save the space
p_values = [0, 1, 2, 4]
d_values = range(0, 3)
q_values = range(0, 3)
warnings.filterwarnings("ignore")
unitedAirline_rmse, unitedAirline_arima_order = evaluate_models(unitedAirline_full_df.values, p_values, d_values, q_values,"United Airline", train_pctg)
For United Airline Best ARIMA(4, 0, 2) RMSE=61138.822
Define a function to fit ARIMA Model with the test dataset and calculate predicted values for the Testing period. Plot expected Vs Predicted values.
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
import statsmodels.tsa.arima_process as arima
from statsmodels.tsa.arima.model import ARIMA, ARIMAResults
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from math import sqrt
def plot_ARIMA(df,test_percent,arima_order,dataset_name ):
X = df
train_size = int(len(X) * test_percent / 100)
train, test = X[0:train_size], X[train_size:len(X)]
history = [x for x in train.values]
predictions = list()
# walk-forward validation
for t in range(0,len(test)):
model = ARIMA(history, order= arima_order ) #(0,1,1))
model_fit = model.fit()
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test.values[t]
history.append(obs)
print('predicted=%f, expected=%f' % (yhat, obs))
# evaluate forecasts
rmse = sqrt(mean_squared_error(test, predictions))
print( "For "+dataset_name + " Dataset : " +'Test RMSE: %.3f' % rmse)
mae = mean_absolute_error(test, predictions)
print( "For "+dataset_name + " Dataset : " +'Test MAE: %.3f' % mae)
mape = mean_absolute_percentage_error(test, predictions)
print( "For "+dataset_name + " Dataset : " +'Test MAPE: %.3f' % mape)
# plot forecasts against actual outcomes
pyplot.figure(figsize=(12,5), dpi=100)
fig, ax = plt.subplots(figsize=(20, 5) )
# Rotating X-axis labels
ax.tick_params(axis='x', labelrotation = 50)
space = 6
ax.xaxis.set_major_locator(ticker.MultipleLocator(space))
pyplot.plot(train, label='Training Data', linewidth=2)
pyplot.plot(test, 'b', label='Testing Data', linewidth=2)
pyplot.plot(pd.DataFrame(predictions, test.index), 'r--', label='Forecast Data', linewidth=2)
pyplot.title("ARIMA Forecast \n Observed Vs Predicted Plot for " + dataset_name )
pyplot.legend(loc='upper left', fontsize=8)
pyplot.show()
model_summary.loc[('ARIMA', 'RMSE'),
dataset_name], model_summary.loc[('ARIMA', 'MAE'),
dataset_name], model_summary.loc[('ARIMA', 'MAPE'),
dataset_name] = rmse, mae, mape
plot_ARIMA(klm_full_df, train_pctg, (klm_arima_order),"KLM" )
predicted=5297.472540, expected=6256.000000 predicted=5673.101081, expected=5244.000000 predicted=5672.259565, expected=4695.000000 predicted=6505.316181, expected=6797.000000 predicted=8853.055613, expected=10771.000000 predicted=11787.058107, expected=12097.000000 predicted=12761.302912, expected=11966.000000 predicted=11737.889137, expected=11023.000000 predicted=11286.507209, expected=12130.000000 predicted=10883.749284, expected=11774.000000 predicted=9376.018807, expected=10433.000000 predicted=7742.058672, expected=5822.000000 predicted=5126.216997, expected=6447.000000 predicted=5957.682558, expected=6222.000000 predicted=6416.567716, expected=5012.000000 predicted=6914.291419, expected=6327.000000 predicted=8507.037679, expected=10831.000000 predicted=11986.056251, expected=11745.000000 predicted=12395.646447, expected=11633.000000 predicted=12242.021466, expected=10562.000000 predicted=11006.839812, expected=11510.000000 predicted=10498.249075, expected=11669.000000 predicted=9191.318117, expected=10221.000000 predicted=7433.781773, expected=7366.000000 predicted=5990.243765, expected=7321.000000 predicted=6686.649529, expected=5930.000000 predicted=6633.196530, expected=5338.000000 predicted=7342.435504, expected=7726.000000 For KLM Dataset : Test RMSE: 1089.247 For KLM Dataset : Test MAE: 936.701 For KLM Dataset : Test MAPE: 0.119
<Figure size 1200x500 with 0 Axes>
plot_ARIMA(unitedAirline_full_df, train_pctg, (unitedAirline_arima_order),"United Airline" )
predicted=627664.046981, expected=727154.000000 predicted=716906.066515, expected=644051.000000 predicted=625576.128550, expected=591906.000000 predicted=603241.063232, expected=718590.000000 predicted=636523.086197, expected=747788.000000 predicted=785509.829848, expected=804017.000000 predicted=748545.684213, expected=839154.000000 predicted=815298.635011, expected=839442.000000 predicted=823814.826971, expected=860524.000000 predicted=781566.852220, expected=744977.000000 predicted=742376.872709, expected=775863.000000 predicted=708143.470031, expected=683178.000000 predicted=681656.231765, expected=725061.000000 predicted=710721.018150, expected=636301.000000 predicted=595317.569935, expected=591246.000000 predicted=624896.586169, expected=731963.000000 predicted=697115.358376, expected=719732.000000 predicted=763896.281637, expected=801291.000000 predicted=748036.648619, expected=848604.000000 predicted=841584.516292, expected=874616.000000 predicted=862408.632754, expected=865614.000000 predicted=777647.483805, expected=771810.000000 predicted=781593.993193, expected=795921.000000 predicted=717107.818166, expected=711754.000000 predicted=716882.466594, expected=738399.000000 predicted=712289.239268, expected=633387.000000 predicted=590729.603371, expected=615685.000000 predicted=655440.676462, expected=752729.000000 For United Airline Dataset : Test RMSE: 61138.822 For United Airline Dataset : Test MAE: 48985.629 For United Airline Dataset : Test MAPE: 0.067
<Figure size 1200x500 with 0 Axes>
Autoregression is a linear regression model that uses lagged variables as input.The statsmodels library provides an AutoReg class that can be instantiated passing a lag value parameter. After creating the model we call fit() method to train it on our dataset which returns an AutoRegResults object. Once fit, we can use the model to make a predict future values by calling the predict() function.
def plot_autoreg(df , test_percent, dataset_name) :
from statsmodels.tsa.ar_model import AutoReg
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from math import sqrt
X = df
train_size = int(len(X) * test_percent / 100)
train, test = X[0:train_size], X[train_size:len(X)]
# train autoregression Model
window = 12
model = AutoReg(train.values, lags=12)
model_fit = model.fit()
print("Autoregression Forecast for "+dataset_name)
#print(model_fit.summary())
coef = model_fit.params
# walk forward over time steps in test
history = train.values[len(train)-window:]
#print(history)
history = [history[i] for i in range(len(history))]
predictions = list()
for t in range(len(test)):
length = len(history)
lag = [history[i] for i in range(length-window,length)]
yhat = coef[0]
for d in range(window):
yhat += coef[d+1] * lag[window-d-1]
obs = test.values[t]
predictions.append(yhat)
history.append(obs)
print('predicted=%f, expected=%f' % (yhat, obs))
# Print Forecasting Errors to evaluation
rmse = sqrt(mean_squared_error(test, predictions))
print("For "+dataset_name + " Dataset : " +'Test RMSE: %.3f' % rmse)
mae = mean_absolute_error(test, predictions)
print("For "+dataset_name + " Dataset : " +'Test MAE: %.3f' % mae)
mape = mean_absolute_percentage_error(test, predictions)
print("For "+dataset_name + " Dataset : " +'Test MAPE: %.3f' % mape)
# plot forecasts against actual outcomes
pyplot.figure(figsize=(12,5), dpi=100)
fig, ax = plt.subplots(figsize=(20, 5) )
# Rotating X-axis labels
ax.tick_params(axis='x', labelrotation = 50)
space = 6
ax.xaxis.set_major_locator(ticker.MultipleLocator(space))
pyplot.plot(train, label='Training Data', linewidth=2)
pyplot.plot(test, 'b', label='Testing Data', linewidth=2)
pyplot.plot(pd.DataFrame(predictions, test.index), 'r--', label='Forecast Data', linewidth=2)
pyplot.title("Autoregression Forecast \n Observed Vs Predicted Plot for " + dataset_name )
pyplot.legend(loc='upper left', fontsize=8)
pyplot.show()
model_summary.loc[('Auto Regression', 'RMSE'),
dataset_name], model_summary.loc[('Auto Regression', 'MAE'),
dataset_name], model_summary.loc[('Auto Regression', 'MAPE'),
dataset_name] = rmse, mae, mape
plot_autoreg(klm_full_df , train_pctg, "KLM")
Autoregression Forecast for KLM predicted=5390.621050, expected=6256.000000 predicted=5464.292496, expected=5244.000000 predicted=4983.271111, expected=4695.000000 predicted=5871.312458, expected=6797.000000 predicted=9679.825112, expected=10771.000000 predicted=12575.738599, expected=12097.000000 predicted=12730.879380, expected=11966.000000 predicted=12043.588589, expected=11023.000000 predicted=11665.712694, expected=12130.000000 predicted=11628.530433, expected=11774.000000 predicted=10159.478245, expected=10433.000000 predicted=7285.231787, expected=5822.000000 predicted=5533.644140, expected=6447.000000 predicted=5721.738229, expected=6222.000000 predicted=5559.787519, expected=5012.000000 predicted=6289.698440, expected=6327.000000 predicted=9434.477162, expected=10831.000000 predicted=12590.295134, expected=11745.000000 predicted=12454.230054, expected=11633.000000 predicted=11630.430279, expected=10562.000000 predicted=11431.790294, expected=11510.000000 predicted=11492.272158, expected=11669.000000 predicted=10254.010601, expected=10221.000000 predicted=7170.188997, expected=7366.000000 predicted=6442.477499, expected=7321.000000 predicted=6663.081950, expected=5930.000000 predicted=5436.517374, expected=5338.000000 predicted=6170.904480, expected=7726.000000 For KLM Dataset : Test RMSE: 777.294 For KLM Dataset : Test MAE: 638.596 For KLM Dataset : Test MAPE: 0.080
<Figure size 1200x500 with 0 Axes>
plot_autoreg(unitedAirline_full_df , train_pctg, "United Airline")
Autoregression Forecast for United Airline predicted=663624.602202, expected=727154.000000 predicted=675266.822014, expected=644051.000000 predicted=622358.328467, expected=591906.000000 predicted=688398.619584, expected=718590.000000 predicted=679307.626092, expected=747788.000000 predicted=791136.439380, expected=804017.000000 predicted=819565.053535, expected=839154.000000 predicted=868372.218292, expected=839442.000000 predicted=879847.217961, expected=860524.000000 predicted=781980.091317, expected=744977.000000 predicted=746390.050650, expected=775863.000000 predicted=696227.394836, expected=683178.000000 predicted=715968.177886, expected=725061.000000 predicted=712795.311407, expected=636301.000000 predicted=631458.552499, expected=591246.000000 predicted=705462.174218, expected=731963.000000 predicted=723698.260468, expected=719732.000000 predicted=809762.575510, expected=801291.000000 predicted=834198.322803, expected=848604.000000 predicted=884874.625571, expected=874616.000000 predicted=910027.876354, expected=865614.000000 predicted=794147.459832, expected=771810.000000 predicted=786916.865479, expected=795921.000000 predicted=703907.383278, expected=711754.000000 predicted=730099.878923, expected=738399.000000 predicted=706329.262353, expected=633387.000000 predicted=637739.754874, expected=615685.000000 predicted=721062.574319, expected=752729.000000 For United Airline Dataset : Test RMSE: 34771.434 For United Airline Dataset : Test MAE: 28288.749 For United Airline Dataset : Test MAPE: 0.040
<Figure size 1200x500 with 0 Axes>
def plot_seasonal_decomposition(df, dataset_name):
from pylab import rcParams
import statsmodels.api as sm
df= df.groupby('Activity_Period_Start_Date')['Adjusted Passenger Count'].sum().reset_index()
df=df.set_index('Activity_Period_Start_Date')
y = df['Adjusted Passenger Count'].resample('MS').mean()
print("Decomposition Plots for ", dataset_name )
rcParams['figure.figsize'] = 12, 8
decomposition = sm.tsa.seasonal_decompose(y, model='additive')
fig = decomposition.plot()
plt.show()
plot_seasonal_decomposition(klm_ds,'KLM')
Decomposition Plots for KLM
plot_seasonal_decomposition(unitedAirline_ds,'United Airline')
Decomposition Plots for United Airline
def compute_SARIMAX_param(df, dataset_name ):
df= df.groupby('Activity_Period_Start_Date')['Adjusted Passenger Count'].sum().reset_index()
df=df.set_index('Activity_Period_Start_Date')
y = df['Adjusted Passenger Count'].resample('MS').mean()
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
best_aic, best_cfg , best_scfg= float("inf"), None, None
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(y,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
#print('ARIMA{}x{}12 - AIC:{} - BIC:{}'.format(param, param_seasonal, results.aic, results.bic))
if results.aic < best_aic:
best_aic, best_cfg, best_scfg = results.aic , param , param_seasonal
except:
continue
print("For "+ dataset_name +': Best AIC:{} SARIMAX: {} x {}'.format(best_aic, best_cfg, best_scfg))
return results.aic , param , param_seasonal
klm_aic , klm_param , klm_param_seasonal = compute_SARIMAX_param(klm_ds , 'KLM' )
For KLM: Best AIC:1648.5400930612086 SARIMAX: (1, 1, 1) x (0, 1, 1, 12)
unitedAirline_aic , unitedAirline_param , unitedAirline_param_seasonal = compute_SARIMAX_param(unitedAirline_ds , 'United Airline' )
For United Airline: Best AIC:2326.082245597009 SARIMAX: (0, 1, 1) x (0, 1, 1, 12)
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_squared_error
def plot_SARIMAX_FCST_model_summary(df, arima_order, sea_order, dataset_name):
df= df.groupby('Activity_Period_Start_Date')['Adjusted Passenger Count'].sum().reset_index()
df=df.set_index('Activity_Period_Start_Date')
y = df['Adjusted Passenger Count'].resample('MS').mean()
mod = sm.tsa.statespace.SARIMAX(y,
order= arima_order, #(0, 1, 1),
seasonal_order = sea_order, # (0, 1, 1, 12),
#enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print("SARIMAX Model Summary for "+ dataset_name)
print(results.summary().tables[1])
results.plot_diagnostics(figsize=(16, 8))
plt.show()
pred = results.get_prediction(start=pd.to_datetime('2015-01-01'), dynamic=False)
pred_ci = pred.conf_int()
ax = y['2005':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='SARIMAX Forecast', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel( dataset_name +' Adjusted Passenger Count')
ax.set_title ( " SARIMAX Validate Forecast With Test Data For "+dataset_name + " Adjusted Passenger Count")
plt.legend(loc='upper left', fontsize=8)
plt.show()
y_forecasted = pred.predicted_mean
y_truth = y['2015-01-01':]
#print(y_forecasted)
#print(y_truth)
# Calculate the Root mean square error
rmse = sqrt(mean_squared_error(y_truth, y_forecasted))
print("For "+dataset_name + " Dataset : " +'Test RMSE: %.3f' % rmse)
# Calculate the mean absolute error
mae = mean_absolute_error(y_truth, y_forecasted)
print( "For "+dataset_name + " Dataset : " +'Test MAE: %.3f' % mae)
# Calculate the mean absolute percentage error
mape = mean_absolute_percentage_error(y_truth, y_forecasted)
print( "For "+dataset_name + " Dataset : " +'Test MAPE: %.3f' % mape)
pred_uc = results.get_forecast(steps=60)
pred_ci = pred_uc.conf_int()
ax = y.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_title(" SARIMAX Forecast for Next 5 Years : "+dataset_name)
ax.set_ylabel(dataset_name +'Adjusted Passenger Count')
plt.legend()
plt.show()
model_summary.loc[('SARIMAX', 'RMSE'),
dataset_name], model_summary.loc[('SARIMAX', 'MAE'),
dataset_name], model_summary.loc[('SARIMAX', 'MAPE'),
dataset_name] = rmse, mae, mape
plot_SARIMAX_FCST_model_summary(klm_ds, unitedAirline_param, klm_param_seasonal, 'KLM')
SARIMAX Model Summary for KLM
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.4099 0.486 -0.844 0.399 -1.362 0.542
ma.L1 0.2310 0.524 0.441 0.660 -0.797 1.259
ar.S.L12 -0.1689 0.187 -0.905 0.366 -0.535 0.197
ma.S.L12 -0.2909 0.227 -1.282 0.200 -0.736 0.154
sigma2 6.011e+05 6.16e+04 9.762 0.000 4.8e+05 7.22e+05
==============================================================================
For KLM Dataset : Test RMSE: 681.323 For KLM Dataset : Test MAE: 527.306 For KLM Dataset : Test MAPE: 0.073
plot_SARIMAX_FCST_model_summary(unitedAirline_ds, unitedAirline_param, unitedAirline_param_seasonal, 'United Airline')
SARIMAX Model Summary for United Airline
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.7284 1.720 0.423 0.672 -2.644 4.100
ma.L1 -0.7372 1.706 -0.432 0.666 -4.081 2.607
ar.S.L12 0.6499 0.146 4.463 0.000 0.365 0.935
ma.S.L12 -0.7920 0.173 -4.576 0.000 -1.131 -0.453
sigma2 5.009e+08 4.7e-10 1.07e+18 0.000 5.01e+08 5.01e+08
==============================================================================
For United Airline Dataset : Test RMSE: 18927.785 For United Airline Dataset : Test MAE: 15910.831 For United Airline Dataset : Test MAPE: 0.022
import datetime
def create_Prophet_df(df):
df= df[['Activity_Period_Start_Date','Adjusted Passenger Count']]
df = df.groupby('Activity_Period_Start_Date')['Adjusted Passenger Count'].sum().reset_index() #.to_frame('Adjusted Passenger Count')
#df= df.reset_index()
df = df.set_index('Activity_Period_Start_Date')
df = df['Adjusted Passenger Count'].resample('MS').mean()
df = pd.DataFrame({'Activity_Period_Start_Date':df.index, 'Adjusted Passenger Count':df.values})
return df
unitedAirline_prophet_dataset = create_Prophet_df(unitedAirline_ds)
us_prophet_dataset = create_Prophet_df(us_ds)
other_price_prophet_dataset = create_Prophet_df(other_price_ds)
international_prophet_dataset = create_Prophet_df(international_ds)
domestic_lowFarePriceCode_prophet_dataset = create_Prophet_df(domestic_lowFarePriceCode_ds)
americanAirline_prophet_dataset = create_Prophet_df(americanAirline_ds)
klm_prophet_dataset = create_Prophet_df(klm_ds)
#!pip install pystan
#!pip install convertdate
#!pip install lunarcalendar
#!pip install lunardate
#!pip install holidays
#!pip install ephem
#!pip install prophet
from pandas import read_csv
from pandas import to_datetime
from pandas import DataFrame
from prophet import Prophet
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from matplotlib import pyplot
def plot_prophet_forecast(df, train_percent, dataset_name):
df.columns = ['ds', 'y']
train_size = int(len(X) * train_percent / 100)
test_size = len(df) - train_size
train, future = df[0:train_size], df
future = future.drop(['y'], axis=1)
# define the model
model = Prophet()
# fit the model
model.fit(train)
# use the model to make a forecast
#print(future.tail(test_size))
forecast=model.predict(future)
#print(forecast)
fig1 = model.plot(forecast)
ax = fig1.gca()
ax.set_title("Plot of the forecast for " + dataset_name, size=12)
fig1 = model.plot_components(forecast)
ax = fig1.gca()
ax.set_title("The components of the forecast for: " + dataset_name, size=8)
# calculate MAE between expected and predicted values
y_true = df['y'][-test_size:].values
#print(y_true)
y_pred = forecast['yhat'][-test_size:].values
#print(y_pred)
mae = mean_absolute_error(y_true, y_pred)
print('MAE: %.3f' % mae)
mape = mean_absolute_percentage_error(y_true, y_pred)
print('MAPE: %.3f' % mape)
pyplot.figure(figsize=(10,5), dpi=100)
pyplot.plot(train['y'], label='Training Data', linewidth=2)
pyplot.plot([None for i in train['y']] + [x for x in y_true] , 'b', label='Testing Data', linewidth=2)
pyplot.plot([None for i in train['y']] + [x for x in y_pred], 'r--', label='Forecast Data', linewidth=2)
pyplot.title(" Prophet Forecast \n Observed Vs Predicted Plot for " + dataset_name )
pyplot.legend(loc='upper left', fontsize=8)
pyplot.legend()
pyplot.show()
model_summary.loc[('Prophet', 'MAE'),
dataset_name], model_summary.loc[('Prophet', 'MAPE'),
dataset_name] = mae, mape
train_pctg_prophet = 79
plot_prophet_forecast(klm_prophet_dataset, train_pctg_prophet, "KLM")
12:45:33 - cmdstanpy - INFO - Chain [1] start processing 12:45:33 - cmdstanpy - INFO - Chain [1] done processing
MAE: 511.827 MAPE: 0.065
plot_prophet_forecast(unitedAirline_prophet_dataset, train_pctg_prophet, "United Airline")
12:45:35 - cmdstanpy - INFO - Chain [1] start processing 12:45:35 - cmdstanpy - INFO - Chain [1] done processing
MAE: 49676.393 MAPE: 0.071
This project focuses on developing models to forecast air travel demand. Time Series Analysis and Forecasting is performed using several statistical Models and visualizations. The crucial step here is to verify the model's correctness by doing an error measurement. It is essential to note the forecasting deviation from the actual data. We will evaluate Forecasting Models using MAPE - Mean Absolute PercentageError. The table below shows the MAPE, MAE (and RMSE in some cases) for each forecasting model used to generate the projected data.
If the value of data is large, e.g., Adjusted Passengers Count ranging from 1000 to 1000,000, the use of MAPE method is a better choice. The reason is that the value of MAE may be too big and lead to confusion. For example, for a data value of 10,000, the value for MAE is 500 and the corresponding value for MAPE is 5%, which is within good engineering tolerance. However, if the absolute value of 500 is used, it is quite large in an absolute sense and just leads to confusion.
# Define a function to highlight minimum value
def highlight_min_mae(x):
min_val = model_summary.loc[(slice(None), ["MAE"]), :].min()[x.name]
return ['background-color: gold' if val == min_val else '' for val in x]
# Define a function to highlight minimum MAPE value
def highlight_min_mape(x):
min_val = model_summary.loc[(slice(None), ["MAPE"]), :].min()[x.name]
return ['background-color: yellow' if val == min_val else '' for val in x]
# Apply the function to the DataFrame
highlighted_df = model_summary.style.apply(highlight_min_mape)
# Display the highlighted DataFrame
highlighted_df
| KLM | United Airline | US GEO Region | Other price category code | International GEO Summary | Domestic Low Fare | ||
|---|---|---|---|---|---|---|---|
| Linear Regression(BoxCox) | RMSE | 2685.737314 | 152895.433783 | 182783.021964 | 45061.787454 | 55470.076661 | 42233.340956 |
| MAE | 2542.531729 | 122828.409131 | 140841.017055 | 32284.756073 | 46089.507968 | 30036.874134 | |
| MAPE | 0.329932 | 0.178330 | 0.097860 | 0.092683 | 0.103915 | 0.087194 | |
| ARIMA | RMSE | 1089.246865 | 61138.821906 | 123301.060220 | 32222.489131 | 36794.801265 | 32082.360631 |
| MAE | 936.700759 | 48985.628795 | 94641.184292 | 25387.216639 | 32860.195834 | 24273.419569 | |
| MAPE | 0.119040 | 0.066857 | 0.061137 | 0.067356 | 0.076280 | 0.064415 | |
| Auto Regression | RMSE | 777.294130 | 34771.434079 | 47893.535137 | 26527.394827 | 22200.133503 | 26831.447808 |
| MAE | 638.595839 | 28288.749407 | 39416.883434 | 20430.226713 | 18759.625438 | 21018.442733 | |
| MAPE | 0.079590 | 0.039734 | 0.026006 | 0.054614 | 0.042402 | 0.056434 | |
| SARIMAX | RMSE | 681.323248 | 18927.784672 | 36049.018413 | 32408.617739 | 8366.976434 | 10514.497452 |
| MAE | 527.306047 | 15910.830757 | 31936.121924 | 28725.828745 | 6840.929062 | 9355.119672 | |
| MAPE | 0.073190 | 0.021543 | 0.020538 | 0.017461 | 0.015193 | 0.023767 | |
| Prophet | MAE | 511.826505 | 49676.393422 | 32746.651315 | 37290.179675 | 22127.973104 | 30407.522015 |
| MAPE | 0.065351 | 0.070902 | 0.021498 | 0.022758 | 0.046498 | 0.082329 |
model_summary_mape = model_summary.stack().reset_index()
model_summary_mape = model_summary_mape[model_summary_mape['level_1'] == 'MAPE']
# Changing columns name
model_summary_mape.rename(columns={"level_2": "Dataset", "level_0": "Forecasting_Model", 0:"MAPE"}, inplace=True)
sns.barplot(x='MAPE', y='Dataset', hue= 'Forecasting_Model', data=model_summary_mape, orient='h', palette='viridis')
<Axes: xlabel='MAPE', ylabel='Dataset'>
Overall SARIMAX model seems suitable for most of the sub-datasets we tested.
To expand the study of this dataset, we can run the models against sub-datasets for other regions, and carriers and evaluate the model’s fit with a larger number of datasets.
Data Source: https://data.world/data-society/air-traffic-passenger-data
Source Code: https://github.com/naeemasma/cscie83/tree/main